Groupe 13 : Acciari Claire, Blart Louise
Article : Bayesian Community Detection - S. L. van der Pas, A. W. van der Vaart

Introduction

The United States National Research Council defines Network Science as "the study of network representations of physical, biological, and social phenomena leading to predictive models of these phenomena". It is a modern academic fiels which studies complex networks, considering distinct elements, represented by nodes ( or vertices ) and connected by links ( or edges). An edge typically connects a pair of vertices.
Community structure is one of the most widely used structures for network data : their nodes are organised into groups called communities ( or clusters). Identifying such sub-groups provides important insights into network information mechanism and how network topology affects each other.

The community detection is an history of a Sleeping Beauty in science. Infact, the first who talked about Community structure is Zachary W.W. in 1977 [1]. But this article was very little cited before 2002, and the publication of the first algorithm for community detection, based on modularity - M Girvan, MEJ Newman 2002 [2].
A wide variety of methods have been proposed to estimate the latent community membership of nodes. These methods can be classified as in the User Guide of Santo Fortunato [3]:

Community detection in networks is an ill-defined problem. There is no universal definition of community. Consequently, there are no clearcut guidelines on how to assess the performance of different algorithms and how to compare them with each other. Traditional definitions of community rely on counting edges whereas modern view rely on probabilities. In the second view, the existence of communities implies that vertices interact more strongly with the other members of their community than they do with vertices of the other communities. Consequently, there is a preferential linking pattern between vertices of the same group, that the model have to take into account. The most famous model of networks with group structure is the stochastic block model (SBM) proposed by Fienberg and Wasserman 1981 [8].

The remainder of the paper is organized as follows. In sectrion 0, we introduce the database and the notation. In Section 1, we analyse two algorithms of community detection : a spectral clustering algorithm and Newman Girvan algorithm. In Section 2, we focus on the paper Bayesian Community Detection - S.L. Van Der Pas, A.W. Van Der Vaart - 2016 [9]. In Section 3, we establish methods to estimate the number of communites.

0 - Notations and Database

0.1 - Stochastic Block Model ( SBM)

Let G = (V,E) be an undirected graph with vertex (or nodes) set $V = ( v_1, ... , v_n)$ and egdes E.

The most famous model of networks with group structure is the stochastic block model (SBM) proposed by Holland et al. 1983 [10]; Snijders and Nowicki, 1997 [11].
Suppose we have an undirected random graph G with n nodes numbered 1,2,...,n, and K $\in$ {1,2,...} classes. A network can be represented as an $N \times N$ adajency matrix $𝐴 \in \{0, 1\}^{|V| \times |V|}$ represents the relation between node i and node j : $A_{ij} = 1$ when $(i,j) \in E$. We suppose that self-loops are not allowed, so $A_{ii} = 0$ for i = 1, ... , n. The generative model for the random graph is :

0.2 - Data

In this paper, we will illustrate our methods using different database :

Karate Club network

First, we list the members of the Officer club, and Mr Hi club. We also identify a list of clubs (H for Mr Hi, and O for Officer) to facilitate comparisons with the predictions of the following algorithms.

Email-Eu-core network

1 - Basic algorithms of community detection

To introduce the concept of community detection, we firt implement the algorithm studied in course : a spectral clustering algorithm. Spectral clustering is a popular technique going back to Donath and Hoffman (1973) [12] and Fiedler (1973) [13].
Spectral clustering methods have seen an explosive development and proliferation over the past few years, thanks to Shi & Malik ( 2000 ) [14] and Ng, Jordan & Weiss ( 2002 ) [15]. Contrary to traditional kmean clustering, which are based on convex set, spectral clustering do not assume hypothesis on the shape of the cluster. As a result, it can solve very general problems like intertwined spirals.
Spectral graph clustering is an approach to detect clusters using spectral properties of the graph (Fortunato, 2010 [3]).
The eigenvalue spectrum of several graph matrices typically consists of a dense bulk of closely spaced eigenvalues, plus some outlying eigenvalues. The eigenvectors corresponding to these outliers contain information about the large-scale structure of the network, like community structure.
Spectral clustering consists in generating a projection of the graph vertices in a metric space, by using the entries of those eigenvectors as coordinates. The i-th entries of the eigenvectors are the coordinates of vertex i in a k-dimensional Euclidean space, where k is the number of eigenvectors used.
The resulting points can be grouped in clusters by using standard partitional clustering techniques like k-means clustering (MacQueen, 1967)[16&ots=nPWjCYF9uO&sig=gEeHpH7XzC6GQVfLCa6A8W2vgqE#v=onepage&q=k%20means%20(Macqueen%2C%201967)&f=false )].

According to Verma & Meila ( 2003 ) [17] this algorithm is efficient under few assumption :

Some advantages of this method are noted by Luxburg (2006) [18] :

1.1 - Unnormalized Spectral Clustering

Spectral clustering is a three step algorithm

Step 1 : The Laplacian Matrix L = D- A

An overview over many of its properties can be found in Mohar (1997) [19]. The following proposition of the course summarizes the most important facts needed for spectral clustering :

Proposition 1 : Let G = (V,E,W) a undirected weighted graph, with $W_ij \ge 0 $. For all (i,j) $\in V \times V$, we have :
1) For every vector $f \in R^{|V|}$ , $f^T L f = \frac{1}{2} \sum_{(i,j) \in E} W_{ij} ( f_i - f_j)^2 $
2) L is symmetric and positive semi-definite
3) The smallest eigenvalue of L is 0

An other proposition of the couse will be important for spectral clustering :

Proposition 2 : Let G be an undirected graph with non-negative weights. Then the multiplicity k of the eigenvalue 0 of L equals the number of connected components $V_1, ..., V_k $ in the graph G. The eigenspace of eigenvalue 0 is spanned by the indicator vectors $\mathbb 1 _{V_1} , ... , \mathbb 1 _{V_k} $ of those components.
We refer to the course for the proof of this two propositions.

Step 2 : eigenvalue and eigenvectors

We calculate the k eigenvectors ($v_0$ , ... $v_k$ ) of L corresponding to the k smallest eigenvalues ( $\lambda_0, ... , \lambda_k$ ).

Laplacian matrix is symmetric ( $L^T = L $ ) and positive semi-definite ( the smallest eigenvalue of L is 0). Proposition 1 is verified.

Step 3 : K-Means Clustering

We perform a k-means on the matrix containing the k eigenvectors in columns:

Given a set of points $(x_1, x_2, …, x_n)$, we want to partition the n points (the nodes) into k sets $S = {S_1, S_2, …, S_k} $ $(k \le n)$ by minimizing the distance between points inside each partition:

$$ {\displaystyle {\underset {\mathbf {S} }{\operatorname {arg\,min} }}\sum _{i=1}^{k}\sum _{\mathbf {x} _{j}\in S_{i}}\left\|\mathbf {x} _{j}-{\boldsymbol {\mu }}_{i}\right\|^{2}} $$

where $\mu_i$ is the barycenter of the points in $S_i$.

To select the number of clusters, we use the elbow method like proposed in the course. We run k-means for various cluster values and plot the associated inertia (sum of squared distances of samples to their closest cluster center).

With the elbow method we can see that spectral clustering detects 2 or 3 communities.

Without any normalization on the Laplace matrix, classification is difficult for seven individuals, who are classified as 1 although they belong to club 0.

With a bigger network, we can see that this spectral algorithm has difficulties to detect departments.

1.2 - Normalized spectral Clustering

Often, spectral clustering is also performed with a normalized version of the Laplacian matrix. A commun ways of normalizing L is : $L' = D^{\frac{1}{2}}LD^{\frac{1}{2}}$. The first argument, thanks to Von Luxburg [4], in favor of normalized spectral clustering comes from the graph partitioning point of view. In general, clustering has two different objectives:

Normalized spectral clustering implements both clustering objectives mentioned above, while unnormalized spectral clustering only implements the first objective.

Second argument is the convergence of the normalized graph Laplacian. Luxburg , Bousquet , Belkin - 2004 [20] show that spectral clustering usually converges to an intuitively appealing limit partition of the data space. In contrary, in case of the unnormalized graph Laplacian, equally strong convergence results are difficult to obtain.

With a normalization on the Laplace matrix, classification is less difficult. Only 2 persons are misclassified.

The normalised algorithm perform better with E-mail network.

1.3 - Girvan Newman algorithm

Now that community detection have been introduced by spectral clustering, we introduce the concept of modularity. Girvan and Newman 2004 [21] are the first who introduced modularity with Girvan-Newman algorithm.

This algorithm is based on a divisive method. They first look for the edges in the network that are most "between" to other vertices. The betweenness is a measure that favors edges that lie between communities and disfavors those that lie inside communities. There are many methods to calculte this measure like shortest paths, random walk betweeness, or current-flow betweenness, but we focus on the shortest path betweenness. It is defined as the number of shortest paths in the graph that passes through the node or edge divided by the total number of shortest paths.

For the detection and analysis of community structures, the Girvan-Newman algorithm relies on the iterative elimination of edges that have the highest betweeness. By removing edges from the graph one-by-one, the network breaks down into smaller pieces, so-called communities. Note that, in this methods, they include a " recalculation step" : recalculate betweenness measure after the removal of each edge.

Thus, the general form of their community structure finding algorithm is as follows:

Then, the new problem is : When can we stop the algorithm ? or equivalently How do we know when the communities found by the algorithm are good ones? To answer these questions, Girvan and Newman define a measure of the quality of a particular division of a network : the modularity.

Consider a particular division of a network into k communities. Let us define a $k \times k$ symmetric matrix $e$ whose element $e_{ij}$ is the fraction of all edges in the network that link vertices in community i to vertices in community j. The trace of this matrix $Tr e = \sum_i e_{ii} $ gives the fraction of edges in the network that connect vertices in the same community, and clearly a good division into communities should have a high value of this trace. The trace on its own, however, is not a good indicator of the quality of the division since, for example, placing all vertices in a single community would give the maximal value of $Tr e = 1$ while giving no information about community structure at all. So we further define the row (or column) sums $a_i = \sum_j e_{ij} $, which represent the fraction of edges that connect to vertices in community i. In a network in which edges fall between vertices without regard for the communities they belong to, we would have $e_{ij} = a_ia_j $. Thus we can define a modularity measure by $$ Q = \sum_i (e_{ii} - a_i^2) = Tr e - ||e^2|| $$

The modularity increases a lot for 2 communities (0.38) , then decreases and reaches its maximum for k = 5 ( modularity = 0,4 )

Two persons are misclassified

2 - Bayesian modularity

2.1 - Priors of the model

In this section, we assume $K$ to be known and we consider priors on the parameters of the Stochastic Block Model.

The priors on $\pi$ - the probabilities associated to each communities - and every element in the matrix $P$ - the probabilities of having an edge between communities - are considered to be independent.

$$ \pi \sim \mathcal{Dirichlet}(\alpha, ..., \alpha) \qquad \text{with} \quad \alpha > 0 $$$$ P_{ab} \sim \mathcal{Beta}(\beta_1, \beta_2) \qquad \text{for} \quad 1 \leq a \leq b \leq K \qquad \text{with} \quad \beta_1, \beta_2 > 0 $$

The Bayesian model is completed with priors on the labellings, denoted by $e = (e_1, ..., e_n)$, and edges, denoted by $A$ the adjacency matrix:

$$ e_i | \pi, P \overset{iid}{\sim} \pi \qquad \text{for} \quad 1 \leq i \leq n$$$$ A_{ij} | \pi, P, e \overset{id}{\sim} \mathcal{Bernoulli}(P_{e_i, e_j}) \quad 1 \leq i < j \leq n $$

We introduce the following numbers, the number of nodes in a community, the number of edges between two communities and the maximum number of possible edges between two communities :

$$ n_a(e) = \sum_{i=1}^n \mathbb{1}(e_i=a) $$$$ O_{ab}(e) = \sum_{i,j} A_{ij} \mathbb{1}(e_i=a, e_j=b) \quad \text{if} \; a \neq b \qquad O_{ab}(e) = \sum_{i<j} A_{ij} \mathbb{1}(e_i=a, e_j=b) \quad \text{if} \; a = b $$$$ n_{ab}(e) = n_a(e)n_b(e) \quad \text{if} \; a \neq b \qquad n_{ab}(e) = \frac{n_a(e)(1-n_a(e))}{2} \quad \text{if} \; a = b $$

Now that the model and priors are specified, we can present the Bayesian modularity.

2.2 - Introducing Bayesian modularity

The Bayesian estimator in the case of community detection consists of maximising the likelihood of the posterior distribution of the labelling: $$ \hat{e} = argmax_e p(e|A) $$

This posterior mode is obtained by marginalising the likelihood of the model: $ L(e, A, P, \pi) = \prod_{a\leq b} P_{ab}^{O_{ab}(e)} (1-P_{ab})^{n_{ab}(e) - O_{ab}(e)} \prod_a \pi_a^{n_a(e)} $

Collapsing can be used to give a more convenient and efficient expression for the model. This refers to the integration of nuisance parameters from the model. The SBM has been partially collapsed by Kemp et al. (2004)[22] , but we will consider the full collapsing of both $P$ and $\pi $. As our primary interest is in the clustering $e$, we integrate out $P$ and $\pi$, yielding an explicit expression for the marginal $P(A_{ij}, e, K)$. We emphasize that integration does not change the model, it merely yields a more convenient representation of the relevant parts of the posterior. This integration is made possible by the choice of conjugate priors for $P$ and $\pi$.

By integrating the likelihood with regards to $(\pi, P)$ independent from one another, we have :

$ \int_{[0,1]^{K(K+1)/2}} \int_{S_K} L(e, A, P, \pi) f_P(P) f_\pi(\pi) d\pi dP = \left[ \prod_{a\leq b} \int_0^1 P_{ab}^{O_{ab}(e) + \beta_1 -1} (1-P_{ab})^{n_{ab}(e) - O_{ab}(e) + \beta_2 - 1} \frac{1}{\mathcal{B}(\beta_1, \beta_2)} dP_{ab} \right] \left[ \prod_a \pi_a^{n_a(e) + \alpha - 1} \frac{1}{\mathcal{D}(\alpha)} \right] $

$\qquad = \left[ \prod_{a\leq b} \frac{\mathcal{B}(\beta_1 + O_{ab}(e), \beta_2 + n_{ab}(e) - O_{ab}(e))}{\mathcal{B}(\beta_1, \beta_2)} \right] \left[ \frac{\mathcal{D}((\alpha + n_a(e))_a)}{\mathcal{D}(\alpha)} \right] $

$\qquad = \underbrace{ \left[ \prod_{a\leq b} \frac{\mathcal{B}(\beta_1 + O_{ab}(e), \beta_2 + n_{ab}(e) - O_{ab}(e))}{\mathcal{B}(\beta_1, \beta_2)} \right]}_{(1)} \underbrace{ \left[ \frac{\Gamma(\alpha K)}{\Gamma(\alpha)^K \Gamma(n+\alpha K)} \prod_a \Gamma(n_a(e) + \alpha) \right]}_{(2)} $

Because the two priors are conjugate, we obtain a closed form for this marginalised density. We observe that $(1)$ depends on both $A$ and $e$, whereas $(2)$ only depends on $e$.

We can now introduce the Bayesian modularity by taking the $log$ of that joint density, multiplying it with $n^{-2}$ and omitting the part that does not depend on e: $$ Q_B(e) = \frac{1}{n^2} \sum_{1\leq a \leq b \leq K} log(\mathcal{B}(\beta_1 + O_{ab}(e), \beta_2 + n_{ab}(e) - O_{ab}(e))) + \frac{1}{n^2} \sum_{a=1}^K log(\Gamma(n_a(e) + \alpha)) $$

This function can be seen as a modularity as it depends on the number of edges drawn ($O_{ab}(e)$) and the maximum number of edges that could be drawn ($n_{ab}(e)$) between different labellings, thus measuring the strength of division of a network into modules.

Now, because $p(e|A) = \frac{p(A, e)}{p(A)}$, maximising $p(A,e)$ is equivalent to maximising $p(e|A)$. Thus, because $Q_B$ is a monotonically increasing transform of $p(A,e)$, we can write the maximisation problem as: $$ \hat{e} = argmax_e Q_B(e) $$

We will now apply this MAP estimation to a toy example : the Karate Club network. This example has also been treated in the article and we wish to reproduce its main findings.

In this first chunk of code, we compute the elements useful to compute the Bayesian modularity.

In the article, the maximisation of the bayesian modularity - a non-convex function - on the set of possible node labellings with exactly K classes - a non convex set - was led by the Tabu Search meta-heuristic.

This algorithm explores the space from near to far while keeping track of the solutions it has already explored. One of its particularities is this aspect of a memory and of "tabu" solutions that enables the algorithm to explore the space and even force it to explore a priori less promising regions. It can thus escape local minima thanks to that mechanism. It is however very computationally expensive and other moethods are available for larger datasets.

The following algorithm implements the Tabu Search method. As it was done in the original article, we will use that heuristic to explore the space of solutions and return the optimal solution with regards to the Bayesian modularity.

In order to use the Tabu Search algorithm we need to define the neighbourhood of a specific labelling in the space of possible labellings. As it was not specified in the article we had to choose a way to define that notion to our Tabu Search. This might impact our ability to find the actual optimum as it affects the way the algorithm goes through the space of solutions.

We decided to define the neighbourhood of a labelling for a given graph as the labellings resulting from a vertex move of the original labelling. Note that the labellings resulting from a vertex move are then filtered to remain in the space of solutions, i.e. labellings containing K distinct classes. A vertex move of an edge in a graph corresponds to a change of its label into the label of its adjacent edges. Thus, the neighbourhood we define is a set containing labellings identical to the one considered except for one edge whose label has changed. To speed up the search process we do not consider the vertex moves of every edge of the graph but only 50% of them.

We will first work on the karate club graph, the same one considered in the article.

A representation of the graph and of the club seperation groups is available below.

We first hope to be able to reproduce the results from the article. To do so we run the Tabu Search on a specific initial labelling.

We will see that the algorithm is sensitive to the initialisation and we will run our Bayesian modularity maximisation algorithm on two different initial labellings to illustrate that.

As mentionned in the article, we observe that the solution found by the algorithm depends on the initial labelling we start out with.

To solve this issue, according to the article, we decided to launch multiple Tabu Search, each with a different initial labbeling and to only keep the result of the best performing one with regards to the Bayesian modularity.

After this step, we recover the exact solution obtained in the article.

This solution is very different from the classification obtained if we look at the separation of the club between Mr. Hi and the Officer. We can check that the modularity of that classification is not higher than our solution.

The club separation is not optimal with regards to the Bayesian modularity maximisation. However, this particular classification is the one returned by most algorithms. The Bayesian modularity thus capture a different aspect of communities in graphs from usual algorithms. Here it grouped the nodes with the highest degrees corresponding to the clubs leaders and their closest supporters.

We will now see the effect of increasing the number of communities considered.

We obtain a slightly different result than the one presented in the paper. It is however mentionned by the author that, due to the heuristic nature of the Tabu Search, the solution presented in the paper might not even be the optimal solution with regards to maximising the Bayesian modularity.

With this choice of K=4 communities, we almost recover the club seperation classification, isolating the leaders and their close supporters, except for the fact that nodes 8 and 9 are misclassified.

To make sure that the misclassification of nodes 8 and 9 is not due to the Tabu Search limits, we check that this configuartion does in fact achieve a higher Bayesian modularity.

Misclassifying the nodes 8 and 9 does in fact improve the Bayesian modularity, the result is not due to an under-explored region of the solution space by the Tabu Search.

We have seen in this part that bayesian modularity, while not agreeing with more classical community detection approaches, does provide an insight on the network architecture. Rather than expressing the social sense of communities, it focuses more on the similarities between the way nodes form connections with others. Overall, there is no strong reason to believe that the two Karate clubs do in fact constitute communities in the stochastic senss.

Moreover, the result for four communities enables us to almost recover the club seperation while still setting apart the club leaders from their followers. This point of view provides a great insight in the hierarchy taking place in the network and the Bayesian modularity is for that more informative than usal approaches.

Other modularity approaches have been thought of, we will present in the next section one of them.

2.3 - A comparison with Likelihood modularity

In this section, we compare the Bayesian modularity with the Likelihood modularity.

This criterion was obtained by Bickel and Chen (2009) and resulted from only considering the logarithm of the part of the Likelihood that depended on $A$, i.e. the first product in the Likelihood, and replacing the probabilities $P_{ab}$ by their maximum likelihood estimators : $\hat{P}_{ab} = \frac{O_{ab}(e)}{n_{ab}(e)}$. Which gives :

$ \qquad Q_{ML}(e) = \frac{1}{n^2} log(\prod_{a\leq b} \hat{P}_{ab}^{O_{ab}(e)} (1 - \hat{P}_{ab})^{n_{ab}(e) - O_{ab}(e)}) \\ \qquad \qquad \; \; \; = \frac{1}{n^2} \sum_{a\leq b} O_{ab}(e) log(\hat{P}_{ab}) + (n_{ab}(e) - O_{ab}(e)) log(1 - \hat{P}_{ab}) \\ \qquad \qquad \; \; \; = \frac{1}{n^2} \sum_{a\leq b} n_{ab}(e) \left[ \frac{O_{ab}(e)}{n_{ab}(e)} log(\frac{O_{ab}(e)}{n_{ab}(e)}) + (1 - \frac{O_{ab}(e)}{n_{ab}(e)}) log(1 - \frac{O_{ab}(e)}{n_{ab}(e)}) \right] \\ \qquad \qquad \; \; \; = \frac{1}{n^2} \sum_{a\leq b} n_{ab}(e) \tau(\frac{O_{ab}(e)}{n_{ab}(e)})$

Where $\tau(x) = xlog(x) + (1-x) log(1_x)$.

This criterion is very similar to the first term of the Bayesian modularity where instead of integrating those parameters $P_{ab}$ , we replace them with the most likely values with regards to the graph structure and the current labelling.

The following lemma provides a numerical comparison between the two modularities.


Lemma 2.1 : There exists a constant $C$ such that, for $\mathcal{E} = \{1, ..., K\}^n$ the set of all possible labellings

$$ max_{e\in\mathcal{E}} |Q_B(e) - Q_{ML}(e) - Q_P(e)| \leq \frac{C log n}{n^2} \qquad \text{for} \qquad Q_P(e) = \frac{1}{n^2} \sum_{a : n_a + \lfloor \alpha \rfloor \geq 2} n_a(e) log(n_a(e)) - \frac{1}{n} $$

As a result we have $ max_{e\in\mathcal{E}} |Q_B(e) - Q_{ML}(e)| = \mathcal{O}(\frac{log n}{n}) $.


We implemented the Likelihood modularity optimisation on the same example of the Karate Club network to compare its results with the Bayesian modularity approach.

Here we are quite surprised to find that the classification obtained with the Likelihood modularity does not coincide at all with the results of the Bayesian modularity.

The criterion still isolates the club leaders and their supporters in the same group and divides the rest of the nodes in a fashion that is very different from the club seperation.

We check that this result is not due to a limitation of the Tabu Search ability to explore the set of solutions.

Looking at the previous results, it does not seem like the club seperation achieves a better Likelihood modularity than the solution returned by the Tabu Search. The node classification that we have obtained makes little sense with regards to what we know about the clubs and the result of the Bayesian modularity.

We have however only applied the MAP estimation to a specific and rather small network. The following section introduce the theorems that support the consistency of that estimate.

2.4 - Consistency of the MAP classifier

a) Identifiability and consistancy

A classification $\hat{e}$ is said to be weakly consistent if the fraction of misclassified nodes tends to zero ; and strongly consistent if the probability of misclassifying any of the nodes tends to zero.

Because the way we enumerate the communities is arbitrary, consistency is defined up to a permutation of the labels. So if we let $\sigma$ be a permutation in $\{1, ..., K\}$ and $P_\sigma$ its matrix representation, we are looking for the existence of a permutation $\sigma$ that provides weak or strong consistency.

To formalise these notions, we introduce:

$$ R_{ab}(e,c) = \frac{1}{n} \sum_{i=1}^n 1\{e_i=a, c_i=b\} \qquad \text{for} \quad e, c \in \{1, ..., K\}^n$$$$ f_a(e) = \frac{n_a(e)}{n} \qquad \text{for} \quad e \in \{1, ..., K\}^n$ $$

Now to express consistency, we use the following lemma:


Lemma 2.2 : $ \forall e, c \in \{1, ..., K\}^n \qquad \frac{1}{n} \sum_{i=1}^n \mathbb{1}\{ c_i \neq e_i\} = \frac{1}{2} || Diag(f(c)) - R(e,c)||_1$

Proof : We have $ \sum_{a=1}^K R(e,c)_{aa} = \sum_{a=1}^K \frac{1}{n} \sum_{i=1}^n \mathbb{1}\{e_i=a, c_i=a\} = \frac{1}{n} \sum_{i=1}^n \mathbb{1}\{e_i=c_i\} \qquad \text{so} \qquad \frac{1}{n} \sum_{i=1}^n \mathbb{1}\{ c_i \neq e_i\} = 1 - \sum_{a=1}^K R(e,c)_{aa} $.

Now, $ \sum_{a, b} R(e,c)_{ab} = 1 \qquad \text{so} \qquad 1 - \sum_{a=1}^K R(e,c)_{aa} = \sum_{a\neq b} R(e,c)_{ab}$.

Moreover $ 1 - \sum_{a=1}^K R(e,c)_{aa} = \sum_{a=1}^K f_a(e) - R(e,c)_{aa} = \sum_{a\neq b} R(e,c)_{ab}$.

Finally $ || Diag(f(c)) - R(e,c)||_1 = \sum_{a=1}^K f_a(e) - R(e,c)_{aa} + \sum_{a\neq b} R(e,c)_{ab} = 2 \sum_{a=1}^K f_a(e) - R(e,c)_{aa}$.

Gathering all this we get $ \frac{1}{n} \sum_{i=1}^n \mathbb{1}\{ c_i \neq e_i\} = \sum_{a=1}^K f_a(e) - R(e,c)_{aa} = \frac{1}{2} || Diag(f(c)) - R(e,c)||_1 $.


With this lemma, we can write that $\hat{e}$ is weakly consistent if $ \exists \sigma \quad \text{s.t.} \quad ||P_\sigma R(\hat{e}, Z) - Diag(f(Z))||_1 \longrightarrow 0$.

Similarly, $\hat{e}$ is strongly consistent if $ \exists \sigma \quad \text{s.t.} \quad \mathbb{P}(P_\sigma R(\hat{e}, Z) = Diag(f(Z))) \longrightarrow 1$.

A necessary requirement for consistency is that the classes can be recovered from the likelihood, which means that the model parameters must be identifiable. If every element of $\pi$ is strictly positive, this implies that $P$ should not have any identical rows, in this case $(\pi, P)$ is identifiable. In the case where some elements of $\pi$ are null, only the part of $P$ that do not consider those empty communities are possibly identifiable. In this case, we say that $(\pi, P)$ is identifiable if after removing the rows and columns associated to the empty communities from P, the reducted couple is identifiable.

b) Weak consistency

In this section, we introduce the central theorem of the article that states the conditions for the consistency of the MAP classifier $\hat{e} = argmax_e Q_B(e)$.


Theorem 2.1 : This theorem states the conditions of weak consistency under the non-sparse and the sparse setups.

Proof :

  1. We will start by showing the proximity of $\hat{e}$ the MAP classifier to $Z$ the random variable of the nodes labellings with regards to another statistic before applying that to the two cases presented in the theorem.

We introduce :

We have

$ Q_{ML}(e) = \frac{1}{n^2} \sum_{1 \leq a \leq b \leq K} n_{ab}(e) \tau(\frac{O_{ab}(e)}{n_{ab}(e)})$

$ \qquad \quad \, = \frac{1}{n^2} \left[ \sum_{a=1}^K \frac{n_a(e)(n_a(e)-1)}{2} \tau(\frac{\tilde{O}_{aa}(e)}{n_a(e)(n_a(e)-1)}) + \sum_{a<b} n_a(e) n_b(e) \tau(\frac{\tilde{O}_{ab}(e)}{n_a(e)n_b(e)}) \right]$

$ \qquad \quad \, = \frac{1}{2 n^2} \left[ \sum_{a=1}^K n_a(e)(n_a(e)-1) \tau(\frac{\tilde{O}_{aa}(e)}{n_a(e)(n_a(e)-1)}) + \sum_{a\neq b} n_a(e) n_b(e) \tau(\frac{\tilde{O}_{ab}(e)}{n_a(e)n_b(e)}) \right] $

$ \Longrightarrow | Q_{ML}(e) - \mathbb{L}(e) | \leq \frac{1}{2 n^2} \left[ \sum_{a=1}^K | n_a(e)(n_a(e)-1) \tau(\frac{\tilde{O}_{aa}(e)}{n_a(e)(n_a(e)-1)}) - n_a(e)^2 \tau(\frac{\tilde{O}_{aa}(e)}{n_a(e)^2}) | \right]$

Now using lemma 2.3,

$ | Q_{ML}(e) - \mathbb{L}(e) | \leq \frac{1}{2 n^2} \left[ \sum_{a=1}^K n_a(e)(n_a(e)-1) | \tau(\frac{\tilde{O}_{aa}(e)}{n_a(e)(n_a(e)-1)}) - \tau(\frac{\tilde{O}_{aa}(e)}{n_a(e)^2}) | + | n_a(e)(n_a(e)-1) - n_a(e)^2 | | \tau(\frac{\tilde{O}_{aa}(e)}{n_a(e)^2}) | \right] $

$ \qquad \qquad \qquad \; \leq \frac{1}{2 n^2} \left[ \sum_{a=1}^K n_a(e)(n_a(e)-1) l(\frac{\tilde{O}_{aa}(e)}{n_a(e)^2(n_a(e)-1)}) + n_a(e) \| \tau \|_\infty \right] $

$ \qquad \qquad \qquad \; \leq \frac{1}{2 n^2} \left[ \sum_{a=1}^K n_a(e)^2 l(\frac{\tilde{O}_{aa}(e)}{n_a(e)^2(n_a(e)-1)}) \right] + \frac{\| \tau \|_\infty}{2 n} $

$ \qquad \qquad \qquad \; \leq \frac{1}{2 n^2} \left[ \sum_{a=1}^K n_a(e) log(n) \right] + \frac{\| \tau \|_\infty}{2 n} \qquad \text{because} \quad n_a l(\frac{u}{n_a}) \lesssim log(n_a) \leq log(n) \quad \text{for} \quad u \in [0,1] $

Which means that $ | Q_{ML}(e) - \mathbb{L}(e) | \leq \frac{log n}{2 n} + \frac{\| \tau \|_\infty}{2 n} = \mathcal{O}(\frac{log n}{n}) $.

Thanks to lemma 2.1, we also know that the Bayesian modularity is equivalent to the Linkelihood modularity up the order $\frac{log n}{n}$.

Using that, we finally have $ \eta_{n,1} = max_e | Q_B(e) - \mathbb{L}(e) | = \mathcal{O}(\frac{log n}{n})$.

And, having $Q_B(\hat{e}) \geq Q_B(Z)$ almost surely by definition of the MAP, we have $\mathbb{L}(Z) - \mathbb{L}(\hat{e}) \leq 2 \eta_{n,1}$.

We now wish to replace $\tilde{O}$ by its asymptotic equivalent in $\mathbb{L}$.

To do so, we use Lemma 2.4 for $x = x_n = M \frac{\|P\|_\infty^{1/2} \lor \sqrt{n}}{\sqrt{n}}$ with $M\gg1$ large enough. In this case we have the right side of the inequality tending to 0 when $n$ tends to $+ \infty$. Thus for all $\epsilon>0$, there exists $M>0$ and $n_0>0$ such that

$$ \mathbb{P}(max_e \|\tilde{O}(e) - \mathbb{E}(\tilde{O}(e)|Z)\|_\infty > x_n n^2) \leq 2 K^{n+2} e^{-\frac{x_n^2 n^2}{8\|P\|_\infty + 4x_n/3}} < \epsilon $$

This is the definition of order in probability and $ \frac{max_e \|\tilde{O}(e) - \mathbb{E}(\tilde{O}(e)|Z)\|_\infty}{n^2} = \mathcal{O}_\mathbb{P}(\frac{\|P\|_\infty^{1/2} \lor \sqrt{n}}{\sqrt{n}}) $.

Now, to obtain an explicit appromation of $\mathbb{E}(\tilde{O}(e) | Z)$, we use Lemma 2.5, for $R(e,Z) \in \mathbb{R}^{K\times K}$ with $ R_{ab}(e,Z) = \frac{1}{n} \sum_{i=1}^n \mathbb{1}(e_i=a, Z_i=b)$

$$ \mathbb{E}(\tilde{O}(e) | Z) = n^2R(e,Z)PR(e,Z)^T - n Diag(R(e,Z) diagP) $$

Which gives $ max_e \| \frac{1}{n^2} \mathbb{E}(\tilde{O}(e) | Z) - R(e,Z)PR(e,Z)^T \|_\infty = max_e \| \frac{1}{n} Diag(R(e,Z) diagP) \|_\infty$.

Moreover $ \left( R(e,Z) diagP \right)_a = \frac{1}{n} \sum_b P_{bb} \sum_i \mathbb{1}(e_i=a, Z_i=b) \leq \frac{1}{n} \sum_b P_{bb} n_b(e) \leq \frac{1}{n} \sum_b n_b(e) = 1 $.

So $ max_e \| \frac{1}{n} Diag(R(e,Z) diagP) \|_\infty \leq \frac{1}{n} \longrightarrow 0$ and $ max_e \| \frac{1}{n^2} \mathbb{E}(\tilde{O}(e) | Z) - R(e,Z)PR(e,Z)^T \|_\infty \longrightarrow 0 $.

Now we can introduce $L$ the asymptotic equivalent of $\mathbb{L}$ : $ L(e) = \frac{1}{2} \sum_{a,b} f_a(e) f_b(e) \tau \left( \frac{(R(e,Z)PR(e,Z)^T)_{ab}}{f_a(e)f_b(e)}\right)$

Using again Lemma 2.3, we can quantify the difference between $\mathbb{L}$ and its asymptotic equivalent,

$ \qquad max_e |\mathbb{L}(e) - L(e)| = max _e \frac{1}{2} | \sum_{a,b} f_a(e)f_b(e) \tau\left( \frac{n^{-2} \tilde{O}_{ab}(e)}{f_a(e)f_b(e)}\right) - \sum_{a,b} f_a(e) f_b(e) \tau \left( \frac{(R(e,Z)PR(e,Z)^T)_{ab}}{f_a(e)f_b(e)}\right) | \\ \qquad \qquad \qquad \leq max _e \frac{1}{2} \sum_{a,b} | f_a(e)f_b(e) \tau\left( \frac{n^{-2} \tilde{O}_{ab}(e)}{f_a(e)f_b(e)}\right) - f_a(e) f_b(e) \tau \left( \frac{(R(e,Z)PR(e,Z)^T)_{ab}}{f_a(e)f_b(e)}\right) | \\ \qquad \qquad \qquad \leq max _e \frac{1}{2} \sum_{a,b} l\left( | n^{-2} \tilde{O}_{ab}(e) - (R(e,Z)PR(e,Z)^T)_{ab} | \right) \\ \qquad \qquad \qquad \leq max _e \frac{1}{2} \sum_{a,b} l\left( \| n^{-2} \tilde{O}(e) - R(e,Z)PR(e,Z)^T \|_\infty \right) \qquad \text{because l is increasing}$

And using the previous results : $\| n^{-2} \tilde{O}(e) - R(e,Z)PR(e,Z)^T \|_\infty = \| n^{-2} \tilde{O}(e) - n^{-2} \mathbb{E}(\tilde{O}(e)|Z) + \mathbb{E}(\tilde{O}(e)|Z) - R(e,Z)PR(e,Z)^T \|_\infty = \mathcal{O}_\mathbb{P}(\frac{\|P\|_\infty^{1/2} \lor \sqrt{n}}{\sqrt{n}}) $

Which means that $ \eta_{n,2} = max_e |\mathbb{L}(e) - L(e)| = \mathcal{O}_\mathbb{P}(l(\frac{\|P\|_\infty^{1/2} \lor \sqrt{n}}{\sqrt{n}})) $.

We deduce that $ L(Z) - L(\hat{e}) \leq L(Z) - \mathbb{L}(Z) + \mathbb{L}(\hat{e}) - L(\hat{e}) + \mathbb{L}(Z) - \mathbb{L}(\hat{e}) \leq 2(\eta_{n,1} + \eta_{n,2})$

  1. We consider the first setup of the theorem where $(P, \pi)$ is fixed and identifiable.

Let $ R_\delta = \{ R \; \text{probability matrix} \; s.t \; min_{P_\sigma} \| P_\sigma R - Diag(R^T \mathbb{1})\|_1 \geq \delta \; \text{and} \; min_{a : \pi_a>0} (R^T\mathbb{1})_a \geq \delta \}$ for a given $\delta >0$.

We introduce $H_P(R) = \frac{1}{2} \sum_{a,b} (R\mathbb{1})_a (R\mathbb{1})_b \tau\left( \frac{(RPR^T)_{ab}}{(R\mathbb{1})_a (R\mathbb{1})_b} \right)$ for $R \in \mathbb{R}^{K\times K}$.

This function is such that $L(e) = H_P(R(e,Z))$ and $L(Z) = H_P(R(Z,Z)) = H_P(Diag(f(Z)) = H_P(Diag(R(e,Z)^T \mathbb{1})$.

Now we can make the connection between this and $R_\delta$ and set $\eta = \eta_n = inf_{R \in R_\delta} H_P(Diag(R^T\mathbb{1}) - H_P(R)$.

Because $R_\delta$ is a subset of the probability matrices, it is bounded, and because it is the reciprocal image of a closed ensemble for a continuous function, it is also closed. Thus $R_\delta$ is a compact ensemble. We also have the continuity of $H_P(Diag(R^T\mathbb{1}) - H_P(R)$ which implies that $\eta$ is well defined and there exists $R \in R_\delta$ that realises this minimum.

By definition of $R_\delta$, no matrix $R$ in $R_\delta$ can be transformed in a diagonal matrix by permuting its rows and it necessarily has non-zero elements in every column $a$ where $\pi_a>0$. We can thus use Lemma 2.6 : $ \forall R \in R_\delta \quad H_P(R) < H_P(Diag(R^T \mathbb{1}))$ which implies that $\eta_n > 0$.

Using the result in part 1. of the proof and what we just showed, if $R(\hat{e},Z) \in R_\delta$

$$ 0 < \eta_n \leq H_P(Diag(R(\hat{e},Z)^T \mathbb{1})) - H_P(R(\hat{e},Z)) = L(Z) - L(\hat{e}) \leq 2(\eta_{n,1} + \eta_{n,2}) $$

Because here, $\eta_n = \eta $ does not depend on n and is strictly positive, there exists $n_0>0$ such that $\forall n \geq n_0 \quad 2(\eta_{n,1} + \eta_{n,2}) < \eta_n$ and in this case $R(\hat{e},Z) \notin R_\delta$.

Because of the law of large number $R(\hat{e},Z)^T \mathbb{1} \longrightarrow \pi$, so, for n large enough and a $\delta$ small enough, $R(\hat{e},Z)$ satisfies the second condition of $R_\delta$. So, if $R(\hat{e},Z) \notin R_\delta$, it means that there exists $P_\sigma$ such that $\| P_\sigma R - Diag(R^T \mathbb{1})\|_1 \leq \delta$. If this inequality is true for a small enough $\delta$ than it is also true for a large $\delta$.

We have thus proven that $\forall \delta>0, \quad \exists n_1>0 \; \text{and} \; P_\sigma : \forall n\geq n_1 \quad \| P_\sigma R - Diag(R^T \mathbb{1})\|_1 \leq \delta $.

This means that $ min_{P_\sigma} \| P_\sigma R - Diag(R^T \mathbb{1})\|_1 \overset{\mathbb{P}}{\longrightarrow} 0 $ and we proved the weak consistency of $\hat{e}$.

  1. We now consider the second setup where $P=\rho_n S$.

Using Lemma 2.7, we have for a given series $\rho_n \longrightarrow 0$, $\frac{1}{\rho_n} \left( H_{\rho_n S}(Diag(R^T \mathbb{1}) - H_{\rho_n S}(R) \right) \longrightarrow G_S(Diag(R^T\mathbb{1}) - G_S(R) > 0$.

This time $\eta_n = inf_{R \in R_\delta} H_P(Diag(R^T\mathbb{1}) - H_P(R)$ depends on n. It is however bounded below by $\rho_n$ times a strictly positive number that solely depends on $(S, \pi)$.

Thus, if we have $\rho_n \gg \eta_{n,1} + \eta_{n,2}$, for $n$ large enough, we can use the same reasoning as we did in the previous part for $2(\eta_{n,1} + \eta_{n,2}) < \eta_n$ and we have the weak consistency of $\hat{e}$.

And $\rho_n \gg \eta_{n,1} + \eta_{n,2} \Leftrightarrow \rho_n \gg \frac{logn}{n} + \frac{log(\sqrt{n/\rho_n})}{\sqrt{n/\rho_n}} \Leftrightarrow n\rho_n \gg log(\frac{n}{\rho_n})^2$


The following are the lemmas used in the previous proof.


Lemma 2.3 : The function $\tau : [0,1] \longrightarrow \mathbb{R} $ satisfies

$$ \forall x,y,v \in [0,1], \qquad |v \tau(\frac{x}{v}) - v \tau(\frac{y}{v}) | \leq l(|x-y|) $$

With $ l(x) = 2x(1\lor log(\frac{1}{x}))$


Lemma 2.4 : For $\tilde{O}$ as previously defined and for any $ x>0 $

$$ \mathbb{P}(max_e \|\tilde{O}(e) - \mathbb{E}(\tilde{O}(e)|Z)\|_\infty > x n^2) \leq 2 K^{n+2} e^{-\frac{x^2 n^2}{8\|P\|_\infty + 4x/3}} $$

Lemma 2.5 : For $\tilde{O}$ and $R(e,Z)$ as previously defined,

$$ \mathbb{E}(\tilde{O}_{ab}(e) | Z) = n^2 \left( R(e,Z)PR(e,Z)^T - n Diag(R(e,Z) diagP) \right)_{ab} $$

Lemma 2.6 : For any probability matrix $R$

$$ H_P(R) \leq H_P(Diag(R^T\mathbb{1})) $$

Furthermore, if $(P,\pi)$ is identifiable and the columns of $R$ corresponding to positive coordinates of $\pi$ are not identically zero, then the inequality is strict unless there exists some permutation $\sigma$ such that $ P_\sigma R$ is a diagonal matrix.


Lemma 2.7 : For any matrix $P \in \mathbb{R}^{K\times K}$ fixed with its elements in $[0,1]$ and R a probability matrix, for $\rho_n \longrightarrow 0$

$$ \frac{1}{\rho_n} \left( H_{\rho_n P}(Diag(R^T \mathbb{1}) - H_{\rho_n P}(R) \right) \longrightarrow G_P(Diag(R^T\mathbb{1}) - G_P(R) $$

Furthermore, if $(P,\pi)$ is identifiable and the columns of $R$ corresponding to positive coordinates of $\pi$ are not identically zero, then the right side of this inequality is strictly positive unless there exists some permutation $\sigma$ such that $ P_\sigma R$ is a diagonal matrix.


c) Strong consistency

The article also provides conditions for the strong consistency of the MAP classifier.


Theorem 2.2 : This theorem states the conditions of strong consistency under the non-sparse and the sparse setups.


The proof of this theorem relies mainly on the weak consistency of the MAP classifier and some additional lemmas.

d) Illustration on an example

In this section, we wanted to test the result of those theorems and illustrate it with a given example $(P, \pi)$.

To do so, we constructed graphs with the same $(P, \pi)$ and an increasing number of nodes. We wanted to track the evolution of the proportion of misclassified nodes.

The two functions below are responsible for initialising the couple $(P, \pi)$ fixed and identifiable and building a corresponding graph, being provided the number of nodes and communities wished for.

To estimate the proportion of misclassified nodes as a function of the number of nodes in the graph, we considered the following algorithm :

We launched the algorithm on a rather small maximum number of nodes and for only two communities. This illustration is thus really limited but it was not reasonable, computation-wise, to consider anything greater than that.

We now plot the misclassification rates along with the couple $(P, \pi)$ chosen for the graphs.

As we can see, we do observe a decreasing trend but with only a 100 nodes, we are far from observing any convergence to 0.

Because the Tabu Search is really computationally expensive, we could not compute the misclassification rates for more nodes. This is a big limitation of this meta-heuristic and to really simulate the results of the theorems we would need to consider a more efficient optimisation process.



We saw in this section the approach presented in the main article. We illustrated it with simple example and proved its consistency under certain conditions.

However a big limitation of this approach remains the need to know in advance the right number of communities to look for. The following section will present the many approaches in the litterature that propose to also estimate K and discuss their feasability with regards to the Bayesian modularity maximisation.



3 - Number of communities

The stochastic block model (SBM) is one of the best studied network models for community structures. A wide variety of methods have been proposed to estimate the latent community membership of nodes in an SBM. However, most of these works assume that the number of communities k is known a priori. In a realworld network, k is usually unknown and needs to be estimated. Therefore, it is of importance to investigate how to choose k.
Because of the difficulties in finding a scientific paper relating to the method studied in part 2, we will study two different approaches: a statistical one and a dynamic one.

3.1 - A two-stage approach

The two-stage maximum likelihood approach first maps the vertices in the latent space and then uses a mixture model to cluster the resulting positions. Some methods have been proposed to estimate the number of communities. Here, we focus on the penalisation of the likelihood like the famous BIC or AIC approache, for SBM.

For any fixed (P, z), the log-likelihood of the adjacency matrix A under the stochastic block model is $$ \log f(A|P_{ab} , z) = \sum_{1 \le a \le b \le k} ( O_{ab}(z) \log P_{ab} + ( n_{ab}(z) - O_{ab}(z) ) \log (1 - P_{ab} )) $$

Wang and Bickel (2017) proposed a penalized likelihood method with the penalty function $ \lambda \frac{k(k+1)}{2}n \log n $ where $\lambda$ is a tuning parameter. An alternative penalty function $ \frac{k(k+1)}{2} \log n $ (called the “BIC”) was used to select the number of communities in Saldana, Yu and Feng (2017)23. According to Hua, Qina, Yana and Zhaoc 24, this two estimators tend to respectively underestimate and overestimate the number on communities. They therefore propose a “corrected Bayesian information criterion” (CBIC) that is in the midway of those two criteria. Specifically, the penalty function is $$ \lambda n \log k + \frac{k(k+1)}{2}\log n $$

Let Z be the set of all possible community assignments under consideration and let $\xi (z) $ be a prior probability of community assignment z. Assume that the prior density of $P_{ab}$ is given by $\Pi (P_{ab})$. Then the posterior probability of z is $$ P(z|A ) = \frac{g(A|z) \xi(z)}{\sum_{z \in Z} g(A|z) \xi(z)}$$

where g(A|z) is the likelihood of community assignment z, given by $$ g(A|z) = \int f(A| P_{ab}, z) \Pi(P_{ab}) dP_{ab}$$

Under the Bayesian paradigm, a community assignment $\hat z = max_{z \in [k]^n} g(A|z) \xi(z) $ that maximizes the posterior probability is selected ( $\sum_{z \in Z } g(A|z) \xi(z) $ is a constant).

They assume that Z is partitioned into $\cup_{k=1} Z_k$. Let $\tau (Z_k)$ be the size of $Z_k$. An equal probability is assigned to z in the same $Z_k$, i.e., $P(z|Z_k) = \frac{1}{\tau (Z_k)}$ for any z ∈ $Z_k$. Next, they assign $P(Z_k)$ proportional to $\tau^{−\delta}(Z_k)$ for some $\delta$. Where $\delta$ > 0 implies that a small number of communities are plausible while $\delta $ < 0 implies that a large number of communities are plausible. This results in the prior probability $$ \xi (z) = P(z|Z_k)P(Z_k ) \alpha \tau^{-\lambda} (Z_k) \text{, $z \in Z_k$} $$ where $\lambda = 1 + \delta $.
This type of prior distribution on the community assignment suggests a corrected BIC criterion (CBIC) as follows: $$ l(k) = max_{z \in [k]^n} {sup}_{P_{ab} \in \Theta_k} \log f(A| P_{ab}, z) - [\lambda n log k + \frac{k(k+1)}{2} \log n]$$

where the second term is the penalty and λ ≥ 0 is a tuning parameter. Then we estimate k by maximizing the penalized likelihood function: $$ \hat k = arg max_k l(k) $$

Asymptotics of the log-likelihood ratio

In this section, we present the order of the log-likelihood ratio built on the work of Wang and Bickel (2017) [25] : $$ L_{k,k'} = max_{z \in [k']^n} sup_{P_{ab} \in \Theta_{k'}} \log f(A|P_{ab}, z) - \log f(1|P_{ab}^*, z^*) $$ where $P_{ab}^*$ and $z^∗$ are the true parameters. Further, k′ is the number of communities under the alternative model and k is the true number of communities. We are interested in the asymptotic distributions of $L_{k,k′}$.

In the Corrected Bayesian information criterion for stochastic block models article, they split the cases k'<k, k'=k and k'>k. We do not prove all the result of this paper, but we give the condition to obtain the asymptotic distribution of the log-likelihood ratio.

To obtain the asymptotic distribution of $L_{k,k′}$ , we need the following conditions :

Theorem : Asymptotics of the log-likelihood ratio Suppose that $A \sim P_{\theta ^* , z^*}$, under (A1) - (A3), and $\rho_n =1$.

where 
$$ \mu = \frac{1}{n^2}(\sum_{ k' \le a \le b \le k} n'_{ab} ( P_{ab}' \log \frac{P'_{ab}}{ 1 - P'_{ab} } + log( 1- P'_{ab})) - \sum_{ k' \le a \le b \le k} n_{ab} ( P^*_{ab} \log \frac{P^*_{ab}}{ 1 - P^*_{ab} } + log( 1- P^*_{ab}))) $$$$ \sigma^2 (P^*_{ab}) = \frac{1}{n^2}(\sum_{ k' \le a \le b \le k} n'_{ab} P'_{ab} (1 - P'_{ab} ) (\log \frac{P'_{ab}}{ 1 - P'_{ab} })^2 + \sum_{ k' \le a \le b \le k} n_{ab} P^*_{ab} (1 - P^*_{ab} ) (\log \frac{P^*_{ab}}{ 1 - P^*_{ab} })^2 $$

Consistency of the CBIC

We want to know if CBIC choose the correct k with probability tending to one when n goes to infinity. For that, we need an additional assumption : $$ \text{(A4) : } \frac{\eta \mu}{\log k} \rightarrow - \infty \text{, for k' < k} $$

Theorem Consistency of the CBIC : Suppose that $A \sim P_{P_{ab} *, z *} $, (A1) - (A4) hold, and $ \rho_n =1 $. Let l(k) be the penalized likelihood function for the CBIC. If $k = o((\frac{n}{\log n})^{1/2})$,
for k' < k, we have $$ P(l(k') > l(k) ) \rightarrow 0 $$ for k'> k, when $\lambda > \frac{\alpha \log k'} {\log k' - \log k }$ we have $$ P(l(k') > l(k)) \rightarrow 0 $$ where $\alpha$ is given in Theorem 3.

By this theorem, the probability P(l(k') > l(k)) goes to zero, regardless of the value of the tuning parameter $\lambda$ in the case of $k'< k$. When $k' > k$, it depends on the parameter $\lambda$

the proof of this theorem is detailed in this article [Corrected Bayesian Information Criterion]. The Corrected Bayesian Information Criterion leads to a consistent estimator for the number of communities in SBM.

Other approaches

Other very popular method of community detection is variational Expectation Maximization (EM) algorithm - Xu et al (2012)26. It is an iterative method to find maximum likelihood, where the model depends on unobserved latent variables. It could be very useful to estimate the number of communites, but criteria are not exactly the same. EM algorithm requires the knowledge of $p(Z|X,\alpha, \pi)$, which it is not tractable ( no conditional independence) in that type of model. Since $p(X|\alpha, \pi)$ is not tractable, we cannot rely on BIC criteria.
We take this opportunity to introduce the difference between Complete Data Likelihood and Observed Data Likelihood. Complete data likelihood is focusing on a cluster analysis view and favor well separated clusters, implying some robustness against model misspecification, it is constructed assuming that the value of the latent variable$z_i$ is known, MacKenzie, Hines 2018 27 ). Whereas the integrated observeddata likelihood is focussing on a density estimation view and is expected to provide a consistent estimation of the distribution of the data, Biernacki et al (2010)28.

Back on topic, to tackle the issue of $p(Z|X,\alpha, \pi)$, Daudin et al (2008) 29 used a criterion, so-called ICL, based on an asymptotic approximation of the integrated complete-data likelihood. This criterion relies on the joint distribution $p(X,Z| \alpha, \pi)$ rather than $p(X|\alpha, \pi)$ and can be easily computed, even in the case of SBM. But, because it relies on an asymptotic approximation, Biernacki et al (2010) 28 showed, in the case of mixtures of multivariate multinomial distributions, that it may fail to detect interesting structures present in the data, for small sample sizes. To tackle this issue, Pierre Latouche, Etienne Birmele, Christophe Ambroise 30propose a new criterion called ILvb, based on a non asymptotic approximation of the marginal likelihood.

In practice, this procedures of two-stage converges quickly but looses some information by not estimating the positions and the cluster model at the same time. Moreover, such procedures ignore uncertainty in the first stage and are prone to increased erroneous cluster assignments when there is inherent variability in the number of communities. Conversely, the Bayesian algorithm, based on Markov Chain Monte Carlo, McDaid et al 31, Geng et al 32, estimates both the latent positions and the mixture model parameters simultaneously.

3.2 - A modern MCMC techniques

Like in the paper of Van Der Pas and Van Der Vaart9, McDaid and Al 2013 [31] adopt the Bayesian approach of Nowicki and Snijders 2001 [33]. But, they treat K as a random variable with a given prior distribution.
Given N (number of nodes ) and K (number of communities), the SBM describes a random process for assigning the nodes to clusters and then generating a network. Specifically, the cluster memberships are represented by a random vector Z of length N. $ Z_i$ taking values in a finite set {1,...,K} according to probabilities $\pi = \pi_1 , ..., \pi_K$. The prior on $\pi$ is a Dirichlet : $$\pi \sim Dirichlet (\alpha, ..., \alpha ) $$ This describes fully how the N nodes are assigned to the K clusters. Next we describe how, given this clustering z, the edges are added between the nodes.
A network can be represented as an $N \times N$ adajency matrix $ A_{ij} $ represents the relation between node i and node j (taking values 1 or 0 ).

Given $Z = (Z_1,..., Z_n )$ the edges are independently generated as Bernoulli variables with $ P(A_{ij} = 1 | Z ) = P_{Z_i , Z_j } $ for i < j, for a given $K \times K $ symmetric matrix $P = (P_{ab})$. Each of the $P_{ab}$ is drawn from the conjugate Beta($\beta_1, \beta_2$) prior, $$P_{ab} \sim Beta( \beta_1, \beta_2) $$ We assume $\alpha, \beta_1, \beta_2$ > 0.

In summary, given N and K the random process generates $ P, Z, \pi $ and ultimately the network A. The two main variables of interest are the clustering Z and the network A.

We complete the bayesian model by specifying class labels $e = (e_1, ..., e_n ) $ and edges $A= (A_{ij} ) $through $$ e_i | \pi, P \sim \pi $$

$$ A_{ij}|\pi, P, e, K \sim Bernouilli ( P_{e_i , e_j}) $$

pour $ 1 \le i \le j \le n $

Like in section 2, we use collapsing to give a more convenient and efficient expression for the model. We treat K as a random variable and place a Poisson prior on K with rate $\lambda = 1$ , conditioning on K > 0, $$ K \sim Poisson(1) | K>0 $$ which gives us : $$ P(K) = \frac{1}{K!(e-1)}$$ We can simply use $ P(K) \propto \frac{1}{K!} $

This Poisson prior allows the estimation of the number of clusters as an output of the model rather than requiring a user to specify K as an input. Thus, we have a fully Bayesian approach where, other than N, which is taken as given, every other quantity is a random variable with specified priors where necessary, $$ p(A, P, e, \pi, K ) = P(K) \times p(e,\pi|K ) \times p(A, P |e)$$ With this equation we could create an algorithm which, given a network A, would allow us to sample the posterior $ P, e, \pi, K|A $. However, we are only interested in estimates of e, K|A, so we can collapse $P$ and $\pi$ : $$ P(A, e, K ) = P(K) \times \int_\Pi p(e,\pi|K ) d\pi \times \int_\Theta p(A, P |e) dP$$ Define $\Pi$ and $\Theta$ to be the domain of $\pi$ and $P$ .

This allows the creation of an algorithm which searches only over K and e. The algorithm never needs to concern itself with $\pi$ or $P$. Collapsing greatly simplifies the sample space over which the algorithm has to search.

The integration (as in part 2) of the last equation allows an expression for the full posterior distribution to be obtained. Let $n_a$ be the number of nodes in cluster a. $n_a$ is a function of e. Let $O_{ab}$ be the number of edges that exist in block ab, i.e. the block between clusters a and b. Let $n_{ab}$ be the maximum number of edges that can be formed between clusters a and b. For diagonal blocks, $n_{aa} = \frac{1}{2} n_a (n_a -1 )$ for undirected, no self-loops.

The full posterior may be written as : $$ P(A,e,K) \propto \frac{1}{K!} \times \frac{\Gamma(\alpha K) \prod_{a} \Gamma(n_a(e) + \alpha ) }{ \Gamma(\alpha)^K \Gamma(n + \alpha K) } \times \prod_{a \le b} \frac{B ( \beta_1 + O_{ab} , n_{ab} - O_{ab} + \beta_2)}{B(\beta_1, \beta_2 ) } $$ where B(.,.) is the Beta function.

In this McDaid (2013) approach, they use a Markov chain Monte Carlo (MCMC) algorithm which samples, given a network A, from the posterior K,e|A. They develop an algorithm that searches across the full sample space of all possible clusterings for all K, drawing from the posterior e,K|A using the last posterior of P(A,e,K) as the desired stationary distribution of the Markov Chain. They use four moves :

At each iteration, one of these four moves is selected at random. All the moves are essentially Metropolis–Hastings moves; a move to modify e and/or K is generated randomly, proposing a new state (e′, K′), and the ratio of the new density to the old density $\frac{P(e' ,K |A )}{P(e, K|A )} = \frac{P(A,e',K)}{P(A,e,K)} $ is calculated. This is often quite easy to calculate quickly as, for certain moves, only a small number of factors are affected by the proposed move. We must also calculate the probability of this particular move being proposed, and of the reverse move being proposed. The proposal probability ratio is combined with the posterior mass ratio to give us the move acceptance probability,
$$ min (1, \frac{P(A,e',K')}{P(A,e,K)} \times \frac{P_{prop}((K',e') \rightarrow (K,e))}{P_{prop}((K,e) \rightarrow (K',e'))} )$$ where $P_{prop}((K, e) \rightarrow (K′, e′))$ is the probability that the algorithm, given current state (K, e), will propose a move to (K′, e′).

With this algorithm they can both estimate the number of cluster and the attribution cluster of each nodes.

In the article of Sun and al. (2013) 34, they use a same type of algorithm with a modularity objective function. When splitting and merging disciplines in the model, they compare the merged and split partitions and select the option with higher modularity.

Conclusion

Community detection is a vast topic with many approaches. To better understand this topic, we used an educational approach by first implement a spectral clustering algorithm studied in course. On this same method, we showed that a normalisation of the Laplacian Matrix could improved the results. In other way, a modularity point of view can be used with spectral methods, but we chose to introduce the concept of modularity with the Girvan Newman algorithm. It is a very popular algorithm based on the betweenness centrality. Then we focused on a bayesian modularity approach.

The Bayesian modularity maximisation approach provides a full posterior distribution of the labelling, thus providing great information on the model and its parameters. Here we only focused on the posterior mode and its consistency but other aspects of that distribution could be studied and used to quantify uncertainty for instance.
After applying that method to the toy example of the Karate club network, we were able to get a better sense of the particularity of that method compared to other approaches. The way it grouped nodes focused a lot on the similarity of behaviour, making groups emerge that other methods would have ignored. The fact that it singled out the group leaders is of great interests if one were to look at the hierarchical organisation of graphs.
The use of the Tabu Search as our optimisation method however, limited greatly our ability to play with bigger datasets. It would have been of interest to consider another and more efficient solver instead and apply the Bayesian modularity approach to other graphs. Moreover, the illustration of the consistency could greatly benefit from such an improvement. Finally, the main drawback of this approach, as presented in the paper, is the absence of a prior on the number of communities considered.

In the last part, we studied one aspect of the discussion of Van Der Pas and Van Der Vaart : the estimation of the number of communities. Because of difficulties to find scientific papers related to bayesian approach based on modularity, we studied two differents well known process : a Corrected Bayesian Information Criterion and a Markov Chain Monte Carlo algorithm. The first one is a two-stage procedure. It converge quickly but looses some information by not estimating the positions and the cluster model at the same time. Moreover, such procedures ignore uncertainty in the first stage and are prone to increased erroneous cluster assignments when there is inherent variability in the number of communities. The second method, MCMC algorithm, estimates both the latent positions and the mixture model parameters simultaneously. It is sometimes stated that this dynamic approach is sometimes slower than other methods but we didn't measure that. It could be very interesting to implement these methods and compare them with both a the detection abilities and a computational approach.

References

[1] Zachary WW - 1977 - An Information Flow Modelfor Conflict and Fission in Small Groups

[2] Girvan, Newman - 2002 - Community structure in social and biological networks

[3] Santo Fortunato - 2016 - Community detection in networks: A User Guide

[4] Von Luxburg - 2006 - A tutorial on spectral clustering

[5] Ball, Karrer, Newman - 2011 - Efficient and principled method for detecting communities in networks

[6] Newman, Givan - 2004 - Finding and evaluating community structure in networks

[7] Pons, Latapy - 2005 - Computing Communities in Large Networks Using Random Walks - 2005.

[8] Fienberg, Wasserman - 1981 - Categorical Data Analysis of Single Sociometric Relations

[9] S.L. Van Der Pas, A.W. Van Der Vaart - 2016 - Bayesian Community Detection

[10] Holland, Laskey, Leinhardt - 1983 - Stochastic blockmodels: First steps

[11] Snijders, Nowicki - 1997 - Estimation and Prediction for Stochastic Blockmodels for Graphs with Latent Block Structure

[12] Donath and Hoffman - 1973 - Lower Bounds for the Partitioning of Graphs

[13] Fiedler - 1973 - Algebraic connectivity of graphs

[14] Shi, Malik - 2000 - Normalized cuts and image segmentation

[15] Ng, Jordan, Weiss - 2002 - On spectral Clustering : Analysis and an algorithm

[16] MacQueen - 1967 - Some methods for classification and analysis for multivariate observation

[17] Verma, Meila - 2003 - A Comparison of Spectral Clustering Algorithms

[18] Luxburg David, Pál - 2006 - A Sober Look at Clustering Stability

[19] Mohar - 1997 - Some applications of Laplace eigenvalues of graphs

[20] Von Luxburg , Bousquet , Belkin - 2004 - On the Convergence of Spectral Clustering on Random Samples: the Normalized Case

[21] Girvan, Newman - 2004 - Finding and evaluating community structure in networks

[22] - Kemp, Griffiths, Tenenbaum - 2004 - Discovering Latent Classes in relational data

[23] Saldaña, Yu, Feng - 2015 - How Many Communities Are There?

[24] Hu, Qin, Yan, Zhao - 2019 - Corrected Bayesian information criterion for stochastic block models

[25] Wang, Bickel -2017 - Likelihood-based model selection for stochastic block models

[26] Xu, Ke, Wang, Cheng, Cheng - 2012 -A model-based approach to attributed graph clustering

[27] MacKenzie, Hines- 2018 - Basic Presence/Absence Situation

[28] Biernacki C, Celeux G, Govaert G - 2010- Exact and monte carlo calculations of integrated likelihoods for the latent class model.

[29] Daudin , Picard, Robin - 2008 - A mixture model for random graph

[30] Latouche, Birmele, Ambroise - 2010 - Variational Bayesian inference and complexity control for stochastic block models

[31] McDaid, Murphy, Friel, Hurley - 2013 - Improved Bayesian inference for the stochastic block model with application to large networks

[32] Geng, Bhattacharya, Pati - 2019 - Probabilistic Community Detection With Unknown Number of Communities

[33] Nowicki, Snijders - 2001 - Estimation and Prediction for Stochastic Blockstructures

[34] Sun, Kaur, Milojević, Flammini, Menczer - 2013 - Social Dynamics of Science